Analysis of Relationship Between Sales and Income Data#
In this NoteBook, we will be analyzing the relationship between sales of tobacco and median income in each state in the years 2013, 2014, 2016, 2017, 2018, and 2019. We omitted the year 2015 because the original data source U.S._Chronic_Disease_Indicators__Tobacco.csv lacks the average sales data we need for analysis. We first restructured the average sales and the median income data into our desired data frame and displayed them. We then plot 6 choropleth graphs for average sales in each state in the years of 2013 to 2019 (excluding 2015), and 7 choropleth graphs for median income in each state in the years of 2013 to 2019. We put them into the outputs folder of the project. Finally, we compared the graphs that we generated, analyze them, and drew a conclusion from the graphs.
import pandas as pd
import plotly.express as px
Loading Data#
Average Sales Data#
# avg sales data
sales_DATA = pd.read_csv("data/U.S._Chronic_Disease_Indicators__Tobacco.csv", low_memory=False)
# sales for each state in years from 2013 to 2019
# filter question
states_sales = sales_DATA[sales_DATA["Question"] == "Sale of cigarette packs"]
# filter columns
states_sales = states_sales[['YearStart', 'LocationAbbr', 'LocationDesc', 'Question', 'DataValueUnit', 'DataValue']]
states_sales = states_sales.groupby(['YearStart', 'LocationDesc']).sum().reset_index()
# remove states GU, PR, US, VI since they are not actual states.
states_sales = states_sales.loc[states_sales['DataValue'] != 0.0]
states_sales.to_csv("data/states_sales.csv")
# change format of data
sub = states_sales.groupby(['LocationDesc', 'LocationAbbr', 'DataValueUnit']).apply(lambda row: pd.Series({"2013": row.DataValue[row.YearStart == 2013].values[0],
"2014": row.DataValue[row.YearStart == 2014].values[0],
# ignoring 2015 because missing data in the original source
"2016": row.DataValue[row.YearStart == 2016].values[0],
"2017": row.DataValue[row.YearStart == 2017].values[0],
"2018": row.DataValue[row.YearStart == 2018].values[0],
"2019": row.DataValue[row.YearStart == 2019].values[0]}))
sub =sub.reset_index(['LocationDesc', 'LocationAbbr', 'DataValueUnit'], drop=False)
sub.head()
| LocationDesc | LocationAbbr | DataValueUnit | 2013 | 2014 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | AL | pack sales per capita | 64.6 | 61.7 | 60.1 | 58.1 | 54.7 | 53.1 |
| 1 | Alaska | AK | pack sales per capita | 39 | 37.2 | 35.6 | 32.9 | 31.1 | 30.4 |
| 2 | Arizona | AZ | pack sales per capita | 24.4 | 23 | 23.1 | 22.5 | 21.8 | 20.7 |
| 3 | Arkansas | AR | pack sales per capita | 57.5 | 54.4 | 54.2 | 52.4 | 50.2 | 47.6 |
| 4 | California | CA | pack sales per capita | 23.9 | 22.7 | 22 | 20.5 | 16.6 | 15.8 |
Median Income Data#
# median income data
median_DATA = pd.read_csv('data/median_income.csv')
cols_drop = median_DATA.columns[median_DATA.columns.str.contains("Standard")]
median_income = median_DATA.drop(columns=cols_drop).set_index("Location")
#drop 2017(40) and 2013(39)
median_income = median_income.drop(median_income.columns[[0, 1, 4, 9]], axis = 1)
us_state_to_abbrev = {
"Alabama": "AL",
"Alaska": "AK",
"Arizona": "AZ",
"Arkansas": "AR",
"California": "CA",
"Colorado": "CO",
"Connecticut": "CT",
"District of Columbia": "DC",
"Delaware": "DE",
"Florida": "FL",
"Georgia": "GA",
"Hawaii": "HI",
"Idaho": "ID",
"Illinois": "IL",
"Indiana": "IN",
"Iowa": "IA",
"Kansas": "KS",
"Kentucky": "KY",
"Louisiana": "LA",
"Maine": "ME",
"Maryland": "MD",
"Massachusetts": "MA",
"Michigan": "MI",
"Minnesota": "MN",
"Mississippi": "MS",
"Missouri": "MO",
"Montana": "MT",
"Nebraska": "NE",
"Nevada": "NV",
"New Hampshire": "NH",
"New Jersey": "NJ",
"New Mexico": "NM",
"New York": "NY",
"North Carolina": "NC",
"North Dakota": "ND",
"Ohio": "OH",
"Oklahoma": "OK",
"Oregon": "OR",
"Pennsylvania": "PA",
"Rhode Island": "RI",
"South Carolina": "SC",
"South Dakota": "SD",
"Tennessee": "TN",
"Texas": "TX",
"Utah": "UT",
"Vermont": "VT",
"Virginia": "VA",
"Washington": "WA",
"West Virginia": "WV",
"Wisconsin": "WI",
"Wyoming": "WY"
}
median_income = median_income.reset_index().copy()
median_income.iloc[:, 1:len(median_income)] = median_income.iloc[:, 1:len(median_income)].replace(',','', regex=True).astype(float)
median_income["avg_med_income"] = median_income.iloc[:, 1:len(median_income)].mean(axis = 1)
#Abbreviate the States for plotly geographical mapping
median_income["state"] = median_income["Location"]
median_income = median_income.replace({"state": us_state_to_abbrev}).drop(0)
# median_income.columns.values[7] = '2013 Median income'
median_income = median_income.rename(columns={median_income.columns.tolist()[7]:'2013 Median income'})
median_income.head()
| Location | 2019 Median income | 2018 Median income | 2017 Median income | 2016 Median income | 2015 Median income | 2014 Median income | 2013 Median income | avg_med_income | state | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Alabama | 56200.0 | 49936.0 | 51113.0 | 47221.0 | 44509.0 | 42278.0 | 41381.0 | 47519.714286 | AL |
| 2 | Alaska | 78394.0 | 68734.0 | 72231.0 | 75723.0 | 75112.0 | 67629.0 | 61137.0 | 71280.000000 | AK |
| 3 | Arizona | 70674.0 | 62283.0 | 61125.0 | 57100.0 | 52248.0 | 49254.0 | 50602.0 | 57612.285714 | AZ |
| 4 | Arkansas | 54539.0 | 49781.0 | 48829.0 | 45907.0 | 42798.0 | 44922.0 | 39919.0 | 46670.714286 | AR |
| 5 | California | 78105.0 | 70489.0 | 69759.0 | 66637.0 | 63636.0 | 60487.0 | 57528.0 | 66663.000000 | CA |
Map plot for avg sales per state from 2013 - 2019 (6 graphs; 2015 omitted)#
# avg sales (sales per capita) for each year (6 in total)
# Year 2015 omitted
Years = [2013, 2014, 2016, 2017, 2018, 2019]
year = 0
# plotting 6 graphs in 2013, 2014, 2016, 2017, 2018, and 2019
for index in range(3, 9):
ddf = sub.iloc[:, index].astype(float)
fig = px.choropleth(sub,
locations="LocationAbbr",
hover_name="LocationDesc",
color= ddf.values,
locationmode='USA-states',
color_continuous_scale='greens',
range_color = (80000, 40000))
fig.update_layout(title_text = f"Sales per capita in {Years[year]}", geo_scope='usa')
fig.write_html(f"output/avgSales{Years[year]}.html")
year += 1
fig.show()
Map plot for median income from 2013 - 2019 (7 graphs)#
# median income graph for each year (6 in total)
Years1 = [2013, 2014, 2015, 2016, 2017, 2018, 2019]
year = 0
# plotting 7 graphs in 2013 to 2019
for index in reversed(range(1, 8)):
ddf = median_income.iloc[:, index].astype(float)
fig = px.choropleth(median_income,
locations="state",
hover_name="Location",
color=ddf.values,
locationmode='USA-states',
color_continuous_scale='reds',
range_color = (80000, 40000))
fig.update_layout(title_text = f"Median Income in {Years1[year]}", geo_scope='usa')
fig.write_html(f"output/medianIncome{Years1[year]}.html")
year += 1
fig.show()
Analysis#
As we can see from the 6 graphs of average sales, the allocation of which states have the most tobacco pack sales per capita hasn’t changed much, with the state of New Hampshire, Kentucky, West Virginia, and Missouri having relatively the most packs sold in the years between 2013 and 2019 (excluding 2015). However, in the Median income graphs we see that the states with the highest median income are usually New Hampshire, Massachusetts, Connecticut, New Jersey, and Maryland. Just by looking at these data, it seems like there’s no direct relationship between the average amount of pack sold and median income in that state. A higher median income of a state doesn’t necessarily correlate with more packs of tobacco sold, and same with the opposite. We can see that the states with least amount of packs sold (New York, Utah, Washington, and California) are not the ones with the lowest median income (Mississippi, New Mexico, Louisiana, Arkansas, West Virginia, etc). Therefore, we failed to find any significant or meaningful relationship between the sales and income in each state just by visualizing the graphs and understanding which states has higher sales per capita and which states has higher median income. And thus we went one step further and conducted a regression analysis on the two data in the next Notebook.
## put them side by side
from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(
rows=1, cols=2, subplot_titles=['income', 'sales'],
specs=[[{"type": "choropleth"}, {"type": "choropleth"}]])
fig_1 = px.choropleth(median_income,
locations="state",
hover_name="Location",
color="2013 Median income",
locationmode='USA-states',
color_continuous_scale='reds',
range_color = (80000, 40000))
fig_1.update_layout(title_text = "Median Income in 2013", geo_scope='usa')
fig_2 = px.choropleth(sub, locations="LocationAbbr",
hover_name="LocationDesc",
color="2013",
locationmode='USA-states',
color_continuous_scale='greens',
range_color = (80000, 40000))
fig_2.update_layout(title_text = "avg sales in 2013", geo_scope='usa')
fig.add_trace(fig_1['data'][0], row=1, col=1)
fig.add_trace(fig_2['data'][0], row=1, col=2)
fig.show()
from IPython.display import HTML
# HTML(filename='output/medianIncome2013.html')
# HTML(filename='output/medianIncome2014.html')
def putHTML(year, avg_sales):
'''Takes in a year & whether the user wants average sales or median income data, and display the html'''
if avg_sales:
return HTML(filename = f'output/avgSales{year}.html')
else:
return HTML(filename= f'output/medianIncome{year}.html')
putHTML(year = 2013, avg_sales = True)